kermit.columbia.edu

home *** CD-ROM | disk | FTP | other *** search

/ kermit.columbia.edu / kermit.columbia.edu.tar / kermit.columbia.edu / newsgroups / misc.20000114-20000217 / 000006_news@columbia.edu _Sat Jan 15 16:25:30 2000.msg < prev next >

Wrap

Internet Message Format | 2000-02-16 | 7KB

Return-Path: <news@columbia.edu> Received: from newsmaster.cc.columbia.edu (newsmaster.cc.columbia.edu [128.59.59.30]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA25773 for <kermit.misc@watsun.cc.columbia.edu>; Sat, 15 Jan 2000 16:25:30 -0500 (EST) Received: (from news@localhost) by newsmaster.cc.columbia.edu (8.8.5/8.8.5) id QAA22251 for kermit.misc@watsun.cc.columbia.edu; Sat, 15 Jan 2000 16:07:29 -0500 (EST) X-Authentication-Warning: newsmaster.cc.columbia.edu: news set sender to <news> using -f From: fdc@watsun.cc.columbia.edu (Frank da Cruz) Subject: Case Study 8: Unicode Date: 15 Jan 2000 21:07:28 GMT Organization: Columbia University Message-ID: <85qnig$ln8$1@newsmaster.cc.columbia.edu> To: kermit.misc@columbia.edu Who doesn't know what Unicode is? Now that computing has become so widespread and Web-centric -- a revolution in itself -- we are on the brink of another major revolution in computing, one that will have profound effects on all of us and perhaps even on the future course of history. Until now, most computer text has been recorded in single-byte 7-bit or 8-bit character sets (1), one per language or language group. For example, the default character set of the Web is ISO 8859-1 Latin Alphabet 1, which can encode English plus most West European languages: Italian, Spanish, German, Icelandic, etc. But it can't encode East European languages like Polish, Czech, or Hungarian, even though they use the same alphabet, because the accents are different. Nor can it represent languages like Russian, Arabic, Hebrew, or Japanese that use other writing systems. Therefore, to write in languages other than our own we often have to switch character sets, and as anybody who has tried it can tell you, that's a tricky business. And it's even trickier if we need to mix different languages in the same document; for example, Portuguese, Romanian, Russian, and Armenian. The great promise of the Internet is to bring people in all countries together as never before. We can get to know one other and appreciate each other's languages and cultures with unprecedented convenience. And the great lesson of mass computer and Internet culture so far is: for anything to catch on, it has to be easy. Coping with the current Babyl of character sets is anything but easy: different platforms use different private character sets (such as PC code pages), which must map to any of an array of standard character sets (such as the ISO Latin alphabets) or to different private character sets on other platforms. If languages are to be mixed, elaborate and often product-specific switching mechanisms are required. Unicode to the rescue. For more than 10 years, a consortium of corporate, academic, and standards-body representatives has been working to create a single universal character set capable of representing all the world's writing systems. To find out all about Unicode, visit the Unicode Consortium website: http://www.unicode.org/ Unicode marks a fundamental change in how we compute. Each character is represented not by a single byte (1), but can be one, two, three, four, or more bytes, depending on the specific Unicode Transformation Format (UTF) used and the specific characters involved. But since we have fifty years of software written for the one-byte-per-character model, the transition to Unicode will be a long process. One, however, that is already well underway. A major part of this transition is the creation of Unicode fonts. The work is being done piecemeal, with each font containing a (perhaps) different subset of Unicode, with additional characters and writing systems added over time. Your computer might already support Unicode to some extent. To check, visit: http://www.columbia.edu/kermit/utf8.html This is a no-frills plain-text web page containing text in many languages(2) encoded in Unicode Transformation Format 8 (UTF-8). You might see a lot of "unknown glyph" boxes or gibberish, depending on your browser, font, and locale. Now visit: http://www.hclrss.demon.co.uk/unicode/fonts.html for a survey of Unicode fonts to see how you might be able to widen the horizons of your own computer right now. Try installing an updated font and visiting the UTF-8 Sample page again. What you see marks a great leap forward: a vendor-neutral, application- independent method for encoding text in many languages -- and some day, we hope, all languages. Unlike other Web pages you might have seen, there are no tricks here -- for example, no GIFs to represent Chinese or Hebrew. It's just plain text. You can select and copy it like any other text, but whether you can paste it into another application depends on the other application. On Windows 95 and later, for example, you can paste it into Word with a Unicode font such as Arial or Times New Roman selected, and see several of the non-Roman scripts but not necessarily all of them. The Kermit Project has been a member of the Unicode Consortium for years, and now C-Kermit 7.0 supports Unicode as transfer character-set, a file character-set, and a terminal character-set. All of a sudden you have a convenient cross-platform tool for migration to Unicode and interfacing between Unicode and traditional environments. For example: . You can make a connection from a traditional environment to a a Unicode platform (such as Plan 9) and have Kermit translate between your local character-set and Unicode during the terminal session. Or vice versa. (3) . You can send traditionally encoded text (say, Italian encoded in Latin-1 or Code Page 850) to a Unicode environment, and you can import Unicode text to your traditional environment. . You can convert local files from traditional character sets to Unicode, and vice versa. . You can convert between different Unicode Transformation Formats. C-Kermit's Unicode support is integrated with all its other character-set support, which covers: . English and West European (Latin-1) languages. . East European Roman-Alphabet (Latin-2) languages. . Russian, Ukrainian, and other languages written in Cyrillic. . Greek. . Hebrew. . Japanese. Others can, and no doubt will, be added in the future. All of this and more will be included in the forthcoming releases of Kermit 95. Most of what you see on the UTF-8 Sample Page, you will also be able to see on your Kermit 95 screen; it's "just" a matter of having the right font (4). As usual, I've rambled on longer than planned and still only scratched the surface. For greater detail, read Section 6.6 of the ckermit2.txt file. Notes: (1) Oversimplification. Traditional East Asian character sets, among others, use various multibyte encodings. (2) If you can add languages to this page, please let me know. (3) To learn about Unicode support in Linux, visit (4) A GUI window is required in Windows 95 and 98, but not in Windows NT or 2000. - Frank